{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "",
   "id": "85c51d691f6f4cd4"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# 随机森林模型训练与保存\n",
    "\n",
    "在本 Notebook 中，我们将：\n",
    "\n",
    "1. 加载预处理后的训练数据\n",
    "2. 使用随机森林算法训练模型（利用所有 CPU 核心）\n",
    "3. 保存训练好的模型\n",
    "\n",
    "本 Notebook 中的模型训练使用了数据集中全部的特征。\n"
   ],
   "id": "e4f078136fed252d"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 导入必要的库\n",
    "import pandas as pd\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "import joblib\n",
    "import os"
   ],
   "id": "a4fc0fa8a7a354d"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 单元格解释：\n",
    "在此单元格中，我们导入了必需的库：\n",
    "\n",
    "- `pandas` 用于数据处理\n",
    "- `RandomForestClassifier` 用于训练随机森林模型\n",
    "- `joblib` 用于保存和加载模型\n",
    "- `os` 用于操作文件和文件夹路径\n"
   ],
   "id": "fb419bae8c2856f5"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "initial_id",
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 加载预处理后的数据集函数\n",
    "def load_preprocessed_data(data_dir):\n",
    "    X_train = pd.read_csv(os.path.join(data_dir, 'X_train.csv'))  # 读取训练特征数据\n",
    "    y_train = pd.read_csv(os.path.join(data_dir, 'y_train.csv')).values.ravel()  # 读取训练标签数据并扁平化\n",
    "    return X_train, y_train\n"
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 单元格解释：\n",
    "这个函数 `load_preprocessed_data` 用于加载预处理后的训练数据集。我们从指定的 `data_dir` 目录中读取 `X_train.csv`（特征数据）和 `y_train.csv`（标签数据），并返回这两个数据集。\n",
    "\n",
    "- `X_train.csv` 包含训练集的特征。\n",
    "- `y_train.csv` 包含训练集的标签，并使用 `.values.ravel()` 将其转换为一维数组。\n"
   ],
   "id": "900264937f43e04b"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 训练随机森林模型的函数，使用所有 CPU 核心\n",
    "def train_random_forest(X_train, y_train):\n",
    "    rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)  # 使用所有 CPU 核心\n",
    "    rf_model.fit(X_train, y_train)  # 训练模型\n",
    "    return rf_model\n"
   ],
   "id": "3bcf1a6b698e6999"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 单元格解释：\n",
    "`train_random_forest` 函数用于训练一个随机森林模型。我们指定了以下参数：\n",
    "\n",
    "- `n_estimators=100`：随机森林中树的数量。\n",
    "- `random_state=42`：确保结果的可重复性。\n",
    "- `n_jobs=-1`：使用所有可用的 CPU 核心加速训练。\n",
    "\n",
    "训练完成后，该函数返回训练好的模型对象 `rf_model`。\n"
   ],
   "id": "facc68ca23007688"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 保存模型函数\n",
    "def save_model(model, output_dir, model_name='random_forest_model.pkl'):\n",
    "    os.makedirs(output_dir, exist_ok=True)  # 如果目录不存在则创建\n",
    "    model_path = os.path.join(output_dir, model_name)  # 设置模型保存路径\n",
    "    joblib.dump(model, model_path)  # 保存模型\n",
    "    print(f\"Model saved to {model_path}\")\n"
   ],
   "id": "9b8cf874fff0ff62"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 单元格解释：\n",
    "`save_model` 函数用于保存训练好的模型：\n",
    "\n",
    "1. 使用 `os.makedirs` 确保模型保存目录存在，如果不存在则创建。\n",
    "2. 定义模型的保存路径 `model_path`。\n",
    "3. 使用 `joblib.dump` 将模型保存为 `.pkl` 文件，便于后续加载和使用。\n",
    "4. 打印模型保存成功的路径。\n"
   ],
   "id": "f091d4b0ac56f2ce"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 主流程：加载数据、训练模型、保存模型\n",
    "if __name__ == \"__main__\":\n",
    "    # 指定数据目录和模型保存目录\n",
    "    data_dir = os.path.abspath('../data/processed')  # 数据目录\n",
    "    model_dir = os.path.abspath('../models')  # 模型保存目录\n",
    "\n",
    "    # 加载预处理后的数据\n",
    "    print(\"Loading preprocessed data...\")\n",
    "    X_train, y_train = load_preprocessed_data(data_dir)  # 调用函数加载训练数据\n",
    "\n",
    "    # 训练模型\n",
    "    print(\"Training Random Forest model with all CPU cores...\")\n",
    "    rf_model = train_random_forest(X_train, y_train)  # 调用函数训练模型\n",
    "\n",
    "    # 保存模型\n",
    "    print(\"Saving model...\")\n",
    "    save_model(rf_model, model_dir)  # 调用函数保存模型\n",
    "\n",
    "    print(\"Model training and saving completed successfully.\")\n"
   ],
   "id": "660f619b4243987e"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 单元格解释：\n",
    "主流程代码执行了以下步骤：\n",
    "\n",
    "1. 指定数据目录 `data_dir` 和模型保存目录 `model_dir`。\n",
    "2. 加载预处理后的训练数据，调用 `load_preprocessed_data` 函数。\n",
    "3. 训练随机森林模型，调用 `train_random_forest` 函数。\n",
    "4. 保存训练好的模型，调用 `save_model` 函数。\n",
    "5. 最后打印训练和保存过程完成的信息。\n",
    "\n",
    "通过这种方式，我们实现了整个模型训练和保存的流程。\n"
   ],
   "id": "b3e270999e789d80"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
